Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

(a) One-stage

(b) Two-stage

FIGURE 3.27

Eﬀect of hyperparameters λ and τ on one- and two-stage training using 1-bit ResNet-18.

which is termed the ApproxSign function and is used for the backpropagation gradient

calculation of the activation. Compared to the traditional STE, ApproxSign has a shape

similar to that of the original binarization function sign, and thus the activation gradi-

ent error can be controlled to some extent. Similarly, CBCN [149] applies an approximate

function to address the gradient mismatch from the sign function. MetaQuant [38] intro-

duces Metalearning to learn the gradient error of weights using a neural network. IR-Net

[196] includes a self-adaptive Error Decay Estimator (EDE) to reduce the gradient error in

training, which considers diﬀerent requirements on diﬀerent stages of the training process

and balances the update ability of parameters and reduction of gradient error. RBNN [140]

proposes a training-aware approximation of the sign function for gradient backpropagation.

In summary, prior art focuses on approximating the gradient derived from

∂b^a

∂ai,j ^or

∂b^w

∂wi,j ^.

Unlike other approaches, our approach focuses on a diﬀerent perspective of the gradient

approximation, i.e., gradient from

∂G

∂wi,j ^{. Our goal is to decouple}^A^and^w^{to improve the}

gradient calculation of w. RBONN manipulates w’s gradient from its bilinear coupling

variable A ( ^∂G⁽^A⁾

∂wi,j ^{). More speciﬁcally, our RBONN can be combined with the prior art by}

comprehensively considering

∂LS

∂ai,j ^,

∂LS

∂wi,j ^and

∂G

∂wi,j ^{in the backpropagation process.}

3.8.4

Ablation Study

Hyper-parameter λ and τ. The most important hyper-parameter of RBONN are λ and

τ, which control the proportion of LR and the threshold of backtracking in recurrent bilinear

optimization. On ImageNet for 1-bit ResNet-18, the eﬀect of hyperparameters λ and τ is

evaluated under one- and two-stage training. The performance of RBONN is demonstrated

in Fig. 3.27, where λ ranges from 1e−3 to 1e−5 and τ ranges from 1 to 0.1. As observed, with

λ reducing, performance improves at ﬁrst before plummeting. The same trend emerges when

we increase τ in both implementations. As demonstrated in Fig. 3.27, when λ is set to 1e−4

and τ is set to 0.6, 1-bit ResNet-18 generated by our RBONN gets the best performance. As